Machine Learning / Aprendizagem Automática¶

Sara C. Madeira, 2024/25¶

AA Project - Learning about Pet Adoption using PetFinder.my Dataset¶

Created by¶

  • Filipe Marques nº57500
  • Jessica Soares nº43356
  • Leonor Fandinga nº64481

Logistics¶

Read Carefully

Students are encouraged to work in teams of 3 people.

Projects with smaller teams are allowed, in exceptional cases, but will not have better grades for this reason.

The quality of the project will dictate its grade, not the number of people working.

The project's solution should be uploaded in Moodle before the end of December, 22nd 2024 (last day before Christmas holidays).

Teams should upload a .zip file containing all the files necessary for project evaluation. Teams should be registered in Moodle and the zip file, upload by one of the group members, should be identified as AA202425nn.zip where nn is the group number.

It is mandatory to produce a Jupyter notebook containing code and text/images/tables/etc describing the solution and the results. Projects not delivered in this format will not be graded. You can use AA_202425_Project.ipynbas template. In your .zip folder you should also include an HTML version of your notebook with all the outputs.

Decisions should be justified and results should be critically discussed.

Remember that your notebook should be as clear and organized as possible, that is, only the relevant code and experiments should be presented, not everything you tried and did not work (that can be discussed in the text, if relevant)!

Project solutions containing only code and outputs without discussions will achieve a maximum grade of 10 out of 20.

Tools¶

The team should use Python 3 and Jupyter Notebook, together with Scikit-learn, Orange3, or both.

Orange3 can be used through its programmatic version, by importing and using its packages as done with Scikit-leatn, or through its workflow version.

It is up to the team to decide when to use Scikit-learn, Orange, or both.

In this context, your Jupyter notebook might have a mix of code, results, text explanations, workflow figures, etc.

In case you use Orange/workflows for some tasks you should also deliver the workflow files. Your notebook should figures for the workflow used together with an overall explaination and specific descriptions for the options taken in each of their widgets.

You can use this notebook and the sections below as example.

Dataset¶

The dataset to be analysed is PetFinder_dataset.csv, made avaliable together with this project description. This dataset, downloaded from Kaggle, contains selected and modified data from the following competition: PetFinder.my Adoption Prediction.

PetFinder.my has been Malaysia’s leading animal welfare platform since 2008, with a database of more than 150,000 animals. PetFinder collaborates closely with animal lovers, media, corporations, and global organizations to improve animal welfare. Animal adoption rates are strongly correlated to the metadata associated with their online profiles, such as descriptive text and photo characteristics. As one example, PetFinder is currently experimenting with a simple AI tool called the Cuteness Meter, which ranks how cute a pet is based on qualities present in their photos. In this competition data scientists are supposed to develop machine learning approaches to predict the adoptability of pets - specifically, how quickly is a pet adopted? If successful, they will be adapted into AI tools that will guide shelters and rescuers around the world on improving their pet profiles' appeal, reducing animal suffering and euthanization. In this project, your team is supposed to use only tabular data (not Images or Image Metadata) and see how far you can go in predicting and understanding PetFinder.my adoptions. You should use both supervised and unsupervised learning to tackle 2 tasks:

  1. Task 1 (Supervised Learning) - Predicting Adoption and Adoption Speed
  2. Task 2 (Unsupervised Learning) - Charactering Pets and their Adoption Speed

The PetFinder_dataset.csv your machine learning algorithms should learn from has 14.993 instances described by 24 data fields that you might use as categorical/numerical features and corresponds to a modified version of the train.csv file made available for the competition (https://www.kaggle.com/c/petfinder-adoption-prediction/data). The target in the original Kaggle challenge is Adoption Speed.

File Descriptions¶

  • PetFinder_dataset.csv - Tabular/text data for machine learning.
  • breed_labels.csv - Contains Type and BreedName for each BreedID. Type 1 is dog, 2 is cat.
  • color_labels.csv - Contains ColorName for each ColorID.
  • state_labels.csv - Contains StateName for each StateID.

Data Fields¶

  • PetID - Unique hash ID of pet profile
  • Type - Type of animal (1 = Dog, 2 = Cat)
  • AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict in the competition. See section below for more info.
  • Name - Name of pet (Empty if not named)
  • Age - Age of pet when listed, in months
  • Breed1 - Primary breed of pet (see BreedLabels.csv for details)
  • Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
  • Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
  • Color1 - Color 1 of pet (see ColorLabel.csv for details)
  • Color2 - Color 2 of pet (see ColorLabel.csv for details)
  • Color3 - Color 3 of pet (see ColorLabel.csv for details)
  • MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
  • FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
  • Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
  • Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
  • Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
  • Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
  • Quantity - Number of pets represented in profile
  • Fee - Adoption fee (0 = Free)
  • State - State location in Malaysia (Refer to StateLabels dictionary)
  • RescuerID - Unique hash ID of rescuer
  • VideoAmt - Total uploaded videos for this pet
  • PhotoAmt - Total uploaded photos for this pet
  • Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.

AdoptionSpeed¶

The value of AdoptionSpeed describes how quickly, if at all, a pet is adopted:

  • 0 - Pet was adopted on the same day as it was listed.
  • 1 - Pet was adopted between 1 and 7 days (1st week) after being listed.
  • 2 - Pet was adopted between 8 and 30 days (1st month) after being listed.
  • 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.
  • 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

Important Notes on Data Cleaning and Preprocessing¶

  1. Data can contain errors/typos, whose correction might improve the analysis.
  2. Some features can contain many values, whose grouping in categories (aggregation into bins) might improve the analysis.
  3. Data can contain missing values, that you might decide to fill. You might also decide to eliminate instances/features with high percentages of missing values.
  4. Not all features are necessarily important for the analysis.
  5. Depending on the analysis, some features might have to be excluded.
  6. Class distribution is an important characteristic of the dataset that should be carefully taken into consideration. Class imbalance might impair machine learning.

Some potentially useful links:

  • Data Cleaning and Preprocessing in Scikit-learn: https://scikit-learn.org/stable/modules/preprocessing.html#
  • Data Cleaning and Preprocessing in Orange: https://docs.biolab.si//3/visual-programming/widgets/data/preprocess.html
  • Dealing with imbalance datasets: https://pypi.org/project/imbalanced-learn/ and https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets#t7

Task 0 (Know your Data) - Exploratory Data Analysis¶

In this section we aim to better understand the data - including features and their distribution - and to preprocess it for further use.

0.1. Loading Data¶

In [170]:
import numpy as np
import pandas as pd
from LoadingData import * 
table_X, table_y, features, target_name, df = load_data('PetFinder_dataset.csv')

Alt text

0.2. Understanding Data¶

The first step in this project was to understand the data and how different variables relate to the target variable, adoption speed. To do that some plots were developed.

The first one shows the relation between the type of animal and the adoption speed. We were interested in finding out whether the type of the animal (cat or dog) affects the speed at which the animal is adopted. With this is possible to observe that animals that tend to have a higher adoption speed are dogs

In [174]:
# How the type of the animal influences the adoption speed
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(8, 6))

# Plot cats and dogs separately
for pet_type, color, label in zip([1, 2], ['#377eb8', '#ff7f00'], ['Cat', 'Dog']):
    subset = df[df['Type'] == pet_type]
    counts = subset['AdoptionSpeed'].value_counts().sort_index()
    plt.bar(counts.index - 0.2 if pet_type == 1 else counts.index + 0.2, counts, width=0.4, label=label, color=color)

# Customization
plt.title("AdoptionSpeed vs. Type", fontsize=14)
plt.xlabel("Adoption Speed", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.ylim(0,2000)
plt.xticks([0, 1, 2, 3, 4], ['0', "1", "2", "3", "4"])
plt.legend(title="Type")

plt.tight_layout()
plt.show()
No description has been provided for this image

On this next plot it’s shown the relation between vaccination status of the animal and the adoption speed. With this one it’s possible to verify that the animals that are not vaccinated have higher speed for adoption than the one that are vaccinated or the ones that the vaccination status isn’t sure.

In [176]:
# How the Vaccinated of the animal influences the adoption speed
plt.figure(figsize=(8, 6))

# Plot vacsinated, not vaccinated and not sure separately
for pet_vaccination, color, label, x_offset  in zip([1, 2, 3], ['#377eb8', '#ff7f00', '#4daf4a' ], ['Vaccinated', 'Not Vaccinated', 'Not sure'], [-0.2, 0, 0.2]):
    subset = df[df['Vaccinated'] == pet_vaccination]
    counts = subset['AdoptionSpeed'].value_counts().sort_index()
    plt.bar(counts.index + x_offset, counts, width=0.2, label=label, color=color)

# Customization
plt.title("AdoptionSpeed vs. Vaccination", fontsize=14)
plt.xlabel("Adoption Speed", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks([0, 1, 2, 3, 4], ['0', "1", "2", "3", '4'])
plt.legend(title="Vaccination Status")

plt.tight_layout()
plt.show()
No description has been provided for this image

Finally, on the last plot it’s represented the relation between the gender of the animal and the adoption speed and it’s possible to conclude that the female animals are quicker to be adopted than the others (Male and other).

In [178]:
# How the Gender of the animal influences the adoption speed
plt.figure(figsize=(8, 6))

# Plot male, female and other separately
for pet_gender, color, label, x_offset  in zip([1, 2, 3], [ '#377eb8','#f781bf', '#4daf4a'], ['Male', 'Female', 'Other'], [-0.2, 0, 0.2]):
    subset = df[df['Gender'] == pet_gender]
    counts = subset['AdoptionSpeed'].value_counts().sort_index()
    plt.bar(counts.index + x_offset, counts, width=0.2, label=label, color=color)

# Customization
plt.title("AdoptionSpeed vs. Gender", fontsize=14)
plt.xlabel("Adoption Speed", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.ylim(0,2000)
plt.xticks([0, 1, 2, 3, 4], ['0', "1", "2", "3", '4'])
plt.legend(title="Gender")

plt.tight_layout()
plt.show()
No description has been provided for this image

0.3. Preprocessing Data¶

The preprocessing was performed in the module "LoadingData.py". The following modifications were made:

  • Removed description
  • Removed name
  • Removed petid
  • Removed rescuerid
  • Removed breed2
  • Removed color3
  • Removed VideoAMT

Through extensive analysis, it was observed that certain features had missing values in some rows or contained numerous zeros that did not provide meaningful information.

Description and name were removed because they had many missing values, are purely textual information. The IDs were removed due to their nature of being always or almost always unique, providing little value. Breed2, color3, and VideoAMT were removed because they contained many zeros.

In supervised Learning to predict the adoption, the feature "AdoptionSpeed" was changed to 1 if the target values corresponded to the animal being adopted, that is, if the values of "AdoptionSpeed" were equal to 0, 1, 2 and 3, and to 0 if the values were equal to 4. So then, 0 corresponds to the animal not being adopted and 1 to the animal being adopted.

For a better understanding of the data and how the models would ajust to each type of animal we divided the dataframe in two dataframes according to the type of animal. This can be found in the module "LoadingData.py".

Task 1 (Supervised Learning) - Predicting Adoption and Adoption Speed¶

In this task we will be performing 3 classification tasks:

Predicting Adoption (binary classification): a new target "Adopted" was created that considers the pet adopted is AdoptionSpeed is between 0-3 and not adopted if Adoption Speed is 4. These outcomes were encoded respectively as 0 (adopted) and 1 (not adopted).

Predicting AdoptionSpeed (multiclass classification): the original target "AdoptionSpeed" was used, whose values are in the set {0, 1, 2, 3 , 4} (5 classes).

Train specialized models for cats and dogs: The aim of this classification task was to check whether the classification performance improves when Predicting Adoption and Predicting AdoptionSpeed using a model that was trained with only cat/dog instances.

1.1. Specific Data Preprocessing for Classification¶

In [183]:
from LoadingData import * 
from smote import *

table_X, table_y_Adopted, features_Adopted, target_name_Adopted, df_Adopted = loadDataAdopted(df)

#Dogs
table_X_Dogs, table_y_Dogs_Speed, features_Dogs, target_Name_Dogs, df_Dogs = loadDataAnimalType(df,2)
table_X_Cats, table_y_Cats_Speed, features_Cats, target_Name_Cats, df_Cats = loadDataAnimalType(df,1)

#Cats
table_X_Cats_Adopted, table_y_Cats_Adopted, features_Cats_Adopted, target_Name_Cats_Adopted, df_Cats_Adopted = loadDataAdopted(df_Cats)
table_X_Dogs_Adopted, table_y_Dogs_Adopted, features_Dogs_Adopted, target_Name_Dogs_Adopted, df_Dogs_Adopted = loadDataAdopted(df_Dogs)

X_smote, y_smote,df1 = smoteadopted(table_X,table_y_Adopted,features_Adopted)

1.2. Learning and Evaluating Classifiers¶

All models are in a file called "Models.py" to facilitate and better organize this work.

To deal with the data imbalancing we have oversampled the minority class using Synthetic Minority Oversampling Technique (SMOTE). This technique consists on creating synthetic samples of the minority class to balance the dataset, to try and improve the performance of models that may otherwise be biased toward the majority class.

The undersampling of the majority class was avoided so as to not lose relevant data. The minority class has also enough samples that we are confident that it is representative and likely to generate new samples that will not be overfitted.

SMOTE was applied to KNN and Logistic Regression since they are known to be sensitive to class imbalance. Random Forest, Decision Trees and SVM are not as sensitive and therefore did not have SMOTE applied.

Cross-validation was another technique used to evaluate the performance of the model and help ensure that the model generalizes well to unseen data, avoiding overfitting and underfitting. The data was splitted into 10 folds, so that the model could be exposed to different test and training sets. The cross-validation accuracy is the average and the value obtained for all the 10 folds. tted.

1.2.1 Predicting Adoption¶

To predict the adoption of the animal we selected 5 models. Decision Tree, Naive Bayes, K-Nearest Neighbors (KNN), Logistic Regression and Random Forest.

Accuracy was chosen as metric since as the dataset is only mildly imbalanced (i.e. 75-25), the results will still be representative for a significant portion of the dataset. Moreover, as this relates to a pet adoption problem whose solution would potentially be used in shelters, we considered that the impact of false negatives or false positives are of little consequence.

Decision Tree¶

The Decision Tree was the first one to be implemented.

Below it can be observed a representation of the Tree Model, the accuracies obtained for both the train set and the test set and the accuracy obtained after performing the Cross-Validation.

For this tree, 10 leaves were selected, as increasing the number further led to overfitting. Choosing 10 leaves was the optimal decision since beyond this number did not improve accuracy. By limiting the leaves, we created a faster and more efficient classifier without compromising performance.

In [190]:
from Models import * 
%matplotlib inline
OurTree(table_X,table_y_Adopted,10,features_Adopted)
Accuracy on training set: 0.7556229731143425
Accuracy on test set: 0.7514904298713524
Fold:  1, Class dist.: [2980 8491], Acc: 0.751
Fold:  2, Class dist.: [2980 8491], Acc: 0.758
Fold:  3, Class dist.: [2980 8491], Acc: 0.756
Fold:  4, Class dist.: [2980 8491], Acc: 0.751
Fold:  5, Class dist.: [2980 8491], Acc: 0.758
Fold:  6, Class dist.: [2979 8492], Acc: 0.769
Fold:  7, Class dist.: [2980 8492], Acc: 0.753
Fold:  8, Class dist.: [2980 8492], Acc: 0.757
Fold:  9, Class dist.: [2980 8492], Acc: 0.759
Fold: 10, Class dist.: [2980 8492], Acc: 0.739

CV accuracy: 0.755 +/- 0.007
No description has been provided for this image

The training and test accuracies show similar values and therefore it can indicate that the model is generalizing well to data that is has not seen before. We consider an accuracy of 75.5% to be a good performance.

Naive Bayes¶

The Naive Bayes was the probabilistic algorithm used. The smote method wasn't used in this model also, due to not bringing any advantage.

Below it can be observed two representations, one for the confusion matrix obtained from the train set, and the other one for the confusion matrix obtained from the test set. There is also the accuracies obtained for both the train set and the test set and the accuracy obtained after performing the Cross-Validation

In [194]:
from Models import * 
%matplotlib inline
naive(table_X,table_y_Adopted)
Accuracy on training set: 0.7192174913693901
Accuracy on test set: 0.7295262001882649
Fold:  1, Class dist.: [2980 8491], Acc: 0.716
Fold:  2, Class dist.: [2980 8491], Acc: 0.740
Fold:  3, Class dist.: [2980 8491], Acc: 0.706
Fold:  4, Class dist.: [2980 8491], Acc: 0.718
Fold:  5, Class dist.: [2980 8491], Acc: 0.732
Fold:  6, Class dist.: [2979 8492], Acc: 0.711
Fold:  7, Class dist.: [2980 8492], Acc: 0.710
Fold:  8, Class dist.: [2980 8492], Acc: 0.731
Fold:  9, Class dist.: [2980 8492], Acc: 0.728
Fold: 10, Class dist.: [2980 8492], Acc: 0.708

CV accuracy: 0.720 +/- 0.011
No description has been provided for this image
No description has been provided for this image

Overall, this model performs better at predicting animal adoptions than non-adoptions, as demonstrated in the confusion matrix above. This could be attributed to the larger number of animals in the adoption target, which reduces the impact of noise on the predictions. The training and test accuracies show similar values and therefore it can indicate that the model is generalizing well to data that is has not seen before. Although the accurayc is lower than in the previous model, an accuracy of 72.0% still indicates a good performance.

K-Nearest Neighbors (KNN)¶

The K-Nearest Neighbors (KNN) was the distance-based algorithm used. In this case the smote method improved the accuracy, so it was used.

Below it can be observed, once again, two representations, one for the confusion matrix obtained from the train set and another for the confusion matrix obtained from test set. The accuracies obtained for both the train set and the test set and the accuracy obtained after performing the Cross-Validation are can also be observed.

We opted to use 3 neighbors for this analysis. While selecting only 1 neighbor could potentially boost overall performance by approximately 5%, we believe this approach would result in less reliable predictions. Using 3 neighbors provides a better trade-off between performance and predictive reliability. As the data shows, the test accuracy when using only 1 neighbor reaches 99% therefor indicating a overfiting

In [198]:
from Models import * 

%matplotlib inline
X_smote, y_smote,df1 = smoteadopted(table_X,table_y_Adopted,features_Adopted)
Ourknn(X_smote, y_smote,3)
Accuracy on training set: 0.8818800247371676
Accuracy on test set: 0.7717996289424861
Fold:  1, Class dist.: [6791 6791], Acc: 0.746
Fold:  2, Class dist.: [6791 6791], Acc: 0.728
Fold:  3, Class dist.: [6792 6791], Acc: 0.763
Fold:  4, Class dist.: [6792 6791], Acc: 0.777
Fold:  5, Class dist.: [6792 6791], Acc: 0.793
Fold:  6, Class dist.: [6792 6791], Acc: 0.825
Fold:  7, Class dist.: [6791 6792], Acc: 0.808
Fold:  8, Class dist.: [6791 6792], Acc: 0.807
Fold:  9, Class dist.: [6791 6792], Acc: 0.829
Fold: 10, Class dist.: [6791 6792], Acc: 0.799

CV accuracy: 0.788 +/- 0.032
No description has been provided for this image
No description has been provided for this image

This model predicts the number of non-adopted animals more accurately. While it performs better at identifying non-adopted animals, the number of adopted animals incorrectly predicted as non-adopted remains the similar. In comparison to Naive Bayes and Decison Tree, this model is not generalizing as well to data that is has not seen before, however it shows the highest accuracy so far (78.8%).

Logistic Regression¶

The linear model used was the logistic regression due to being a good model when dealing with binary classification. It was also used the smote method.

Below it can be observed some graphs where the importance of each feature is represented. In this model there is a parameter that determines the strength of the regularization, $C$, so in this work this model was tested using diffetent values for this parameter (1, 0.01 and 100). It can also be observed the accuracies for the train and test set, for each one of those cases, and the accuracy obtained with cross-validation.

In [202]:
from Models import * 

logreg(X_smote, y_smote,df1,features_Adopted)
(15092, 17)
Adopted
1.0    7546
0.0    7546
Name: count, dtype: int64
Fold:  1, Class dist.: [6791 6791], Acc: 0.682
Fold:  2, Class dist.: [6791 6791], Acc: 0.638
Fold:  3, Class dist.: [6792 6791], Acc: 0.661
Fold:  4, Class dist.: [6792 6791], Acc: 0.654
Fold:  5, Class dist.: [6792 6791], Acc: 0.661
Fold:  6, Class dist.: [6792 6791], Acc: 0.673
Fold:  7, Class dist.: [6791 6792], Acc: 0.651
Fold:  8, Class dist.: [6791 6792], Acc: 0.644
Fold:  9, Class dist.: [6791 6792], Acc: 0.674
Fold: 10, Class dist.: [6791 6792], Acc: 0.654

CV accuracy: 0.659 +/- 0.013
Train set score (Accuracy)= 0.6609238924649754
Test set score (Accuracy)= 0.6581272084805654
Train set score (Accuracy)= 0.6603559257856872
Test set score (Accuracy)= 0.6576855123674912
Train set score (Accuracy)= 0.659503975766755
Test set score (Accuracy)= 0.6687279151943463
No description has been provided for this image
No description has been provided for this image
Train accuracy of L1 logreg with C=0.001 = 0.64
Test accuracy of L1 logreg with C=0.001 = 0.64
Train accuracy of L1 logreg with C=1.000 = 0.66
Test accuracy of L1 logreg with C=1.000 = 0.66
C:\Users\Filip\anaconda3\Lib\site-packages\sklearn\svm\_base.py:1242: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
Train accuracy of L1 logreg with C=100.000 = 0.66
Test accuracy of L1 logreg with C=100.000 = 0.66
No description has been provided for this image
No description has been provided for this image

Although the training and test accuracies show similar values, this model obtained the lowest accuracy so far (66%). There are also no significant differentes between the several parameters used for regularization (1, 0.01 and 100).

Random Forest¶

The Random Forest was another ensemble model that was implemented.

Below it can be observed a representation of the confusion matrix obtained, the accuracies for both the train set and the test set and the accuracy after performing the Cross-Validation.

Two methodologies were tested with the Random Forest: one with hyperparameter tuning and one without. Overall, both approaches yielded similar performance.

Even without applying SMOTE, cross-validation accuracy was around 70%, likely due to the substantial amount of data labeled as 0,due to a low density of the minority class.

However, when SMOTE was applied alongside hyperparameter tuning, the cross-validation accuracy improved significantly to 82%, potencialy learning more accurate boundaries. As a result, the combination of SMOTE and hyperparameter tuning was adopted for better predictive performance.

In [206]:
from Models import *
RandomF(X_smote, y_smote)
Accuracy on training set: 0.8748122625673647
Accuracy on test set: 0.8359395706334481
No description has been provided for this image
Fold:  1, Class dist.: [6791 6791], Acc: 0.639
Fold:  2, Class dist.: [6791 6791], Acc: 0.618
Fold:  3, Class dist.: [6792 6791], Acc: 0.646
Fold:  4, Class dist.: [6792 6791], Acc: 0.797
Fold:  5, Class dist.: [6792 6791], Acc: 0.920
Fold:  6, Class dist.: [6792 6791], Acc: 0.924
Fold:  7, Class dist.: [6791 6792], Acc: 0.922
Fold:  8, Class dist.: [6791 6792], Acc: 0.908
Fold:  9, Class dist.: [6791 6792], Acc: 0.940
Fold: 10, Class dist.: [6791 6792], Acc: 0.908

CV accuracy: 0.822 +/- 0.129

1.2.2 Predicting AdoptionSpeed¶

To predict the adoption speed of the animal we also selected 5 models. Tree, Naive Bayes, K-Nearest Neighbors (KNN), Support Vector Machine (SVM) and Random Forest.

Even knowing the results are in general better when the ammount of classes is lower in the target we decided to use 5 classes to this predictions.

Decision Tree¶

When predicting the Adoption speed, the Decision Tree was also the first one to be implemented. The smote method didn't bring any advantages so it wasn't applied.

Below it can be observed a representation of the Tree Model, the accuracies obtained for both the train set and the test set and the accuracy obtained after performing the Cross-Validation The selected number of branches in the decision tree was retained for the same reasons outlined previously.

In [211]:
from Models import * 
%matplotlib inline

OurTree(table_X, table_y,20,features)
Accuracy on training set: 0.37995606234961815
Accuracy on test set: 0.356448070285535
Fold:  1, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.354
Fold:  2, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.372
Fold:  3, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.361
Fold:  4, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.362
Fold:  5, Class dist.: [ 316 2439 3221 2515 2980], Acc: 0.365
Fold:  6, Class dist.: [ 316 2440 3221 2515 2979], Acc: 0.369
Fold:  7, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.352
Fold:  8, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.372
Fold:  9, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.352
Fold: 10, Class dist.: [ 315 2440 3222 2515 2980], Acc: 0.342

CV accuracy: 0.360 +/- 0.009
No description has been provided for this image

Since the training and test accuracies show similar values, it indicates that the model is not memorizing the training set. However, both accuracies are quite low.

Naive Bayes¶

The Naive Bayes was the probabilistic algorithm used. Below it can be observed two representations, one for the confusion matrix obtained from the train set, and the other one for the confusion matrix obtained from the test set. There is also the accuracies obtained for both the train set and the test set and the accuracy obtained after performing the Cross-Validation

In [215]:
from Models import * 
%matplotlib inline
naive(table_X,table_y)
Accuracy on training set: 0.3490950936290407
Accuracy on test set: 0.34860370254157513
Fold:  1, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.368
Fold:  2, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.356
Fold:  3, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.338
Fold:  4, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.341
Fold:  5, Class dist.: [ 316 2439 3221 2515 2980], Acc: 0.340
Fold:  6, Class dist.: [ 316 2440 3221 2515 2979], Acc: 0.352
Fold:  7, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.347
Fold:  8, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.347
Fold:  9, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.356
Fold: 10, Class dist.: [ 315 2440 3222 2515 2980], Acc: 0.330

CV accuracy: 0.348 +/- 0.010
No description has been provided for this image
No description has been provided for this image

K-Nearest Neighbors (KNN)¶

The number of neighbors was kept as is because, although smaller numbers could yield better performance, they might also result in worse outcomes in other aspects.

In [218]:
from Models import * 
%matplotlib inline
X_smote, y_smote,df1 = smoteadoptionspeed(table_X,table_y,features)
Ourknn(X_smote, y_smote,3)
Accuracy on training set: 0.6974068835454974
Accuracy on test set: 0.4724186704384724
Fold:  1, Class dist.: [2546 2545 2545 2545 2545], Acc: 0.458
Fold:  2, Class dist.: [2546 2545 2545 2545 2545], Acc: 0.464
Fold:  3, Class dist.: [2545 2545 2545 2546 2545], Acc: 0.479
Fold:  4, Class dist.: [2545 2545 2545 2546 2545], Acc: 0.472
Fold:  5, Class dist.: [2545 2545 2545 2545 2546], Acc: 0.483
Fold:  6, Class dist.: [2545 2545 2545 2545 2546], Acc: 0.455
Fold:  7, Class dist.: [2545 2545 2546 2545 2545], Acc: 0.460
Fold:  8, Class dist.: [2545 2545 2546 2545 2545], Acc: 0.458
Fold:  9, Class dist.: [2545 2546 2545 2545 2545], Acc: 0.588
Fold: 10, Class dist.: [2545 2546 2545 2545 2545], Acc: 0.630

CV accuracy: 0.495 +/- 0.059
No description has been provided for this image
No description has been provided for this image

The values the training and test accuracies suggest that the model might be overfitting.

Suport Vector Machine (SVM)¶

The linear model used was the Suppor Vector Machine due to being a better model when dealing with multiclass classification. It was also used the smote method. Below it can be observed the accuracies for the train and test set, the accuracy obtained with cross-validation and some of the coefficients and intercept obtained.

In [222]:
from Models import *
svm(table_X,table_y)
Fold:  1, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.346
Fold:  2, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.357
Fold:  3, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.333
Fold:  4, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.360
Fold:  5, Class dist.: [ 316 2439 3221 2515 2980], Acc: 0.344
Fold:  6, Class dist.: [ 316 2440 3221 2515 2979], Acc: 0.353
Fold:  7, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.358
Fold:  8, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.363
Fold:  9, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.356
Fold: 10, Class dist.: [ 315 2440 3222 2515 2980], Acc: 0.340

CV accuracy: 0.351 +/- 0.009
Training set score (Accuracy) = 0.34721205146981904
Test set score (Accuracy) = 0.35707561970505175
------------------------------------------------------------------------------------------
LinearSVC coefficients and intercept:
Coeficients (w) =
 [[ 7.83809898e-06  8.33508544e-06 -2.87790682e-04 -3.26342841e-06
   8.20653586e-06  1.08270637e-05 -3.60710476e-06  1.00720256e-05
   4.25356570e-06  5.35123525e-06  4.19781688e-06  5.09375045e-07
  -8.05500142e-06 -3.74462946e-05 -2.09434799e-05 -3.17319348e-05]
 [ 2.56615961e-05 -2.63347940e-04 -1.13213616e-03 -1.91031185e-05
   4.51061832e-05  3.95016127e-05 -1.10863363e-05  2.16473154e-05
   2.24342478e-05  1.18687157e-05  2.14380418e-05 -1.23812142e-06
  -4.36277045e-05  2.42666816e-05 -6.41633510e-06 -4.47705804e-05]
 [-2.28606850e-07 -1.51550734e-04  1.47097558e-04 -9.75704454e-06
  -1.95939254e-06  6.05786823e-06  5.74039267e-06 -2.74652539e-08
   3.65764078e-06 -5.14964913e-06  1.54960199e-06 -1.44156682e-06
  -6.04014490e-06 -4.86031022e-05 -1.15380660e-05  4.45133056e-05]
 [-2.76209581e-05 -1.27473557e-04 -5.19431452e-04  2.00894240e-05
  -2.54542662e-05 -8.30633723e-06  1.08882801e-05 -1.66911920e-05
  -2.54914157e-05 -2.66017420e-05 -2.20940019e-05 -4.97987744e-07
  -1.15622954e-05 -2.39747800e-04 -1.00676790e-05  3.50999656e-04]
 [-1.88852723e-03  5.30655156e-02  2.29272028e-03  5.26858619e-03
  -7.80421841e-03 -5.24509489e-03 -1.59007336e-03 -3.59996356e-03
  -1.11900935e-03  2.59163550e-03 -3.95055723e-04  5.67413469e-04
   1.50875301e-02 -1.29181731e-06 -2.98093268e-05 -2.38886120e-02]]
Intercept (b) = [-3.44259016e-09 -4.34800533e-09  3.71453007e-09  3.12048471e-08
 -4.27018836e-06]

Random Forest¶

The Random Forest was another tree model that was implemented, and also another example where the smote method didn't bring any advantages. Below it can be observed a representation of the confusion matrix obtained, the accuracies for both the train set and the test set and the accuracy after performing the Cross-Validation

In [225]:
from Models import *
RandomF(table_X,table_y)
Accuracy on training set: 0.5467099068940265
Accuracy on test set: 0.36774395983683716
No description has been provided for this image
Fold:  1, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.369
Fold:  2, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.373
Fold:  3, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.365
Fold:  4, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.378
Fold:  5, Class dist.: [ 316 2439 3221 2515 2980], Acc: 0.374
Fold:  6, Class dist.: [ 316 2440 3221 2515 2979], Acc: 0.387
Fold:  7, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.363
Fold:  8, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.370
Fold:  9, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.367
Fold: 10, Class dist.: [ 315 2440 3222 2515 2980], Acc: 0.341

CV accuracy: 0.369 +/- 0.011

Train specialized models for cats and dogs¶

Adoption Speed¶

For all the models tested, the accuracy was lower when trained specifically on dog data however this difference is not significat.

Dogs¶

Tree Model¶
In [229]:
OurTree(table_X_Dogs, table_y_Dogs_Speed,20,features_Dogs)
Accuracy on training set: 0.3840255591054313
Accuracy on test set: 0.34099616858237547
Fold:  1, Class dist.: [ 201 1404 1581 1081 1367], Acc: 0.360
Fold:  2, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.356
Fold:  3, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.356
Fold:  4, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.358
Fold:  5, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.366
Fold:  6, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.347
Fold:  7, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.329
Fold:  8, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.359
Fold:  9, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.364
Fold: 10, Class dist.: [ 201 1404 1582 1081 1367], Acc: 0.296

CV accuracy: 0.349 +/- 0.020
No description has been provided for this image
Naive Bayes¶
In [231]:
naive(table_X_Dogs, table_y_Dogs_Speed)
Accuracy on training set: 0.3226837060702875
Accuracy on test set: 0.3275862068965517
Fold:  1, Class dist.: [ 201 1404 1581 1081 1367], Acc: 0.329
Fold:  2, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.332
Fold:  3, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.326
Fold:  4, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.335
Fold:  5, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.312
Fold:  6, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.351
Fold:  7, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.324
Fold:  8, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.310
Fold:  9, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.316
Fold: 10, Class dist.: [ 201 1404 1582 1081 1367], Acc: 0.312

CV accuracy: 0.325 +/- 0.012
No description has been provided for this image
No description has been provided for this image

K-Nearest Neighbors (KNN)¶

In [233]:
X_smote, y_smote,df1 = smoteadoptionspeed(table_X_Dogs, table_y_Dogs_Speed,features)
Ourknn(X_smote, y_smote,15)
Accuracy on training set: 0.47716594625070474
Accuracy on test set: 0.3900789177001127
Fold:  1, Class dist.: [1277 1277 1277 1277 1277], Acc: 0.400
Fold:  2, Class dist.: [1277 1277 1277 1277 1277], Acc: 0.417
Fold:  3, Class dist.: [1277 1277 1277 1277 1277], Acc: 0.387
Fold:  4, Class dist.: [1277 1277 1277 1277 1277], Acc: 0.400
Fold:  5, Class dist.: [1277 1277 1277 1277 1277], Acc: 0.385
Fold:  6, Class dist.: [1277 1277 1277 1278 1277], Acc: 0.406
Fold:  7, Class dist.: [1278 1277 1277 1277 1277], Acc: 0.381
Fold:  8, Class dist.: [1277 1278 1277 1277 1277], Acc: 0.426
Fold:  9, Class dist.: [1277 1277 1278 1277 1277], Acc: 0.454
Fold: 10, Class dist.: [1277 1277 1277 1277 1278], Acc: 0.463

CV accuracy: 0.412 +/- 0.027
No description has been provided for this image
No description has been provided for this image
Support Vector Machine (SVM)¶
In [235]:
from Models import *
svm(table_X_Dogs, table_y_Dogs_Speed)
Fold:  1, Class dist.: [ 201 1404 1581 1081 1367], Acc: 0.351
Fold:  2, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.340
Fold:  3, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.323
Fold:  4, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.351
Fold:  5, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.339
Fold:  6, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.329
Fold:  7, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.355
Fold:  8, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.343
Fold:  9, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.351
Fold: 10, Class dist.: [ 201 1404 1582 1081 1367], Acc: 0.337

CV accuracy: 0.342 +/- 0.010
Training set score (Accuracy) = 0.34930777422790205
Test set score (Accuracy) = 0.3397190293742018
------------------------------------------------------------------------------------------
LinearSVC coefficients and intercept:
Coeficients (w) =
 [[-1.22416050e-08 -1.34356196e-04  7.86186622e-04 -2.15750683e-05
   4.61157641e-05  1.60582951e-04 -8.14339073e-06  1.77672776e-04
   7.02812609e-05  8.65329397e-05  4.91440056e-05  8.49401365e-06
  -1.26833622e-04 -1.28990629e-04 -2.74932261e-05 -3.98255522e-04]
 [ 1.66541256e-06 -2.82153854e-02  8.38782648e-04 -1.06608157e-03
   4.12978226e-03 -2.27556058e-04 -3.69437785e-04  2.47661144e-03
   2.36941090e-03  1.41147346e-03  2.21644910e-03 -1.81429512e-04
  -4.39763437e-03 -2.51547640e-04 -1.45168123e-05 -2.36588781e-03]
 [ 1.59178174e-08 -2.88712089e-04  5.61677806e-05 -3.30079148e-05
   3.79255081e-05 -1.66476627e-05  4.87207969e-06 -1.40517164e-05
  -1.56449842e-05 -3.12143469e-05 -6.81148507e-06 -2.85236030e-06
  -1.48666151e-05  4.08900523e-04 -1.11197778e-05  1.58838249e-04]
 [ 2.10055221e-08 -3.35551925e-05 -1.20144651e-04  1.02999530e-05
  -1.68050506e-05  4.56244270e-05  1.12442128e-06 -1.02557487e-05
  -1.29454699e-05 -2.06220297e-05 -8.01328851e-06  1.39869556e-06
  -5.80312350e-06 -1.71011482e-04 -1.40732744e-05  2.20595811e-04]
 [-1.62399517e-05  5.06338416e-02 -2.72591614e-03  9.43713727e-03
  -1.91296711e-02 -5.76416419e-03 -4.20529594e-03 -7.08743922e-03
   6.78749446e-04  7.42946004e-03 -5.17382953e-04  3.78668032e-04
   2.68022584e-02  1.58670659e-04  2.81364446e-06 -2.84530322e-02]]
Intercept (b) = [-6.12080252e-09  8.32706279e-07  7.95890870e-09  1.05027611e-08
 -8.11997584e-06]

Random Forest¶

In [237]:
RandomF(table_X_Dogs, table_y_Dogs_Speed)
Accuracy on training set: 0.5727369542066028
Accuracy on test set: 0.3314176245210728
No description has been provided for this image
Fold:  1, Class dist.: [ 201 1404 1581 1081 1367], Acc: 0.346
Fold:  2, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.379
Fold:  3, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.318
Fold:  4, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.382
Fold:  5, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.324
Fold:  6, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.350
Fold:  7, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.335
Fold:  8, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.356
Fold:  9, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.318
Fold: 10, Class dist.: [ 201 1404 1582 1081 1367], Acc: 0.319

CV accuracy: 0.343 +/- 0.023

Cats¶

With the exception of the KNN model, the accuracy was marginally higher when trained specifically on cat data.

Tree Model¶
In [240]:
OurTree(table_X_Cats, table_y_Cats_Speed,60,features_Cats)
Accuracy on training set: 0.45794776886695454
Accuracy on test set: 0.3871763255240444
Fold:  1, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.387
Fold:  2, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.405
Fold:  3, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.391
Fold:  4, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.385
Fold:  5, Class dist.: [ 114 1035 1640 1433 1614], Acc: 0.407
Fold:  6, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.387
Fold:  7, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.387
Fold:  8, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.403
Fold:  9, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.392
Fold: 10, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.355

CV accuracy: 0.390 +/- 0.014
No description has been provided for this image

Naive Bayes¶

In [242]:
naive(table_X_Cats, table_y_Cats_Speed)
Accuracy on training set: 0.36829117828500924
Accuracy on test set: 0.3563501849568434
Fold:  1, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.362
Fold:  2, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.390
Fold:  3, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.367
Fold:  4, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.348
Fold:  5, Class dist.: [ 114 1035 1640 1433 1614], Acc: 0.374
Fold:  6, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.355
Fold:  7, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.352
Fold:  8, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.346
Fold:  9, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.380
Fold: 10, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.329

CV accuracy: 0.360 +/- 0.017
No description has been provided for this image
No description has been provided for this image
K-Nearest Neighbors (KNN)¶
In [244]:
X_smote, y_smote,df1 = smoteadoptionspeed(table_X_Cats, table_y_Cats_Speed,features)
Ourknn(X_smote, y_smote,7)
Accuracy on training set: 0.575941230486685
Accuracy on test set: 0.47107438016528924
Fold:  1, Class dist.: [1307 1307 1306 1307 1307], Acc: 0.450
Fold:  2, Class dist.: [1307 1307 1306 1307 1307], Acc: 0.410
Fold:  3, Class dist.: [1307 1307 1307 1306 1307], Acc: 0.448
Fold:  4, Class dist.: [1307 1307 1307 1306 1307], Acc: 0.442
Fold:  5, Class dist.: [1306 1307 1307 1307 1307], Acc: 0.481
Fold:  6, Class dist.: [1306 1307 1307 1307 1307], Acc: 0.457
Fold:  7, Class dist.: [1307 1307 1307 1307 1306], Acc: 0.507
Fold:  8, Class dist.: [1307 1307 1307 1307 1306], Acc: 0.507
Fold:  9, Class dist.: [1307 1306 1307 1307 1307], Acc: 0.507
Fold: 10, Class dist.: [1307 1306 1307 1307 1307], Acc: 0.541

CV accuracy: 0.475 +/- 0.038
No description has been provided for this image
No description has been provided for this image
Random Forest¶
In [246]:
RandomF(table_X_Cats, table_y_Cats_Speed)
Accuracy on training set: 0.5395846185482213
Accuracy on test set: 0.40998766954377314
No description has been provided for this image
Fold:  1, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.401
Fold:  2, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.405
Fold:  3, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.396
Fold:  4, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.357
Fold:  5, Class dist.: [ 114 1035 1640 1433 1614], Acc: 0.422
Fold:  6, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.403
Fold:  7, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.383
Fold:  8, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.367
Fold:  9, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.418
Fold: 10, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.350

CV accuracy: 0.390 +/- 0.024

Support Vector Machine¶

In [248]:
svm(table_X_Cats, table_y_Cats_Speed)
Fold:  1, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.348
Fold:  2, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.390
Fold:  3, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.368
Fold:  4, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.374
Fold:  5, Class dist.: [ 114 1035 1640 1433 1614], Acc: 0.388
Fold:  6, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.380
Fold:  7, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.383
Fold:  8, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.394
Fold:  9, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.392
Fold: 10, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.346

CV accuracy: 0.376 +/- 0.016
Training set score (Accuracy) = 0.3802179724449928
Test set score (Accuracy) = 0.3723797780517879
------------------------------------------------------------------------------------------
LinearSVC coefficients and intercept:
Coeficients (w) =
 [[-1.04201362e-09  1.27213265e-05 -2.91347484e-04 -2.44616858e-06
   2.83742724e-06 -7.04570169e-06 -1.44049270e-06  1.62119180e-06
  -4.05516918e-07  8.45631526e-07  1.60099993e-06  4.22711687e-08
  -2.19728504e-06  1.63707762e-05 -2.13030042e-05 -1.26915196e-05]
 [-3.39425624e-09 -1.19372211e-04 -1.08089254e-03 -1.85033133e-05
   1.77176905e-05  1.24146077e-06 -1.57759737e-06  1.19385312e-05
   7.73008662e-06  3.79555742e-06  1.10586715e-05 -6.62946849e-07
  -2.82407465e-05  2.66218255e-04 -8.47915555e-06 -4.62646848e-05]
 [ 8.54994644e-09 -4.13246863e-04  4.62384527e-05 -1.36414910e-05
  -4.35018487e-05  4.45842229e-05  2.05132027e-05  1.11513845e-05
   3.00469261e-05  5.05590855e-06  1.31747957e-05 -3.85234770e-06
  -1.27152980e-05 -2.15704504e-04 -1.07753684e-05  5.66758958e-05]
 [ 3.63637621e-08 -2.05031311e-04 -5.45870204e-04  2.88789249e-05
  -4.65005108e-06 -8.80755771e-06  6.97760947e-06 -1.84243421e-05
  -2.62858136e-05 -2.29642040e-05 -3.41450099e-05 -2.71040924e-06
  -1.23099939e-05 -2.94367024e-04 -8.54394643e-06  4.07128114e-04]
 [-1.00080305e-05  5.03314166e-02  2.38883129e-03  1.39355875e-02
  -1.10713064e-02 -5.69439717e-04 -5.16182207e-03 -7.76858041e-03
  -1.99970714e-03  3.73530043e-03  2.63345487e-03  1.92951687e-03
   3.86477354e-02  6.43374711e-05 -3.12065343e-05 -2.15073439e-02]]
Intercept (b) = [-1.04201362e-09 -3.39425624e-09  8.54994644e-09  3.63637621e-08
 -1.00080305e-05]

Prediction Adoption¶

Dogs¶

Tree Model¶

In [252]:
OurTree(table_X_Dogs_Adopted, table_y_Dogs_Adopted,20,features_Dogs)
Accuracy on training set: 0.7799787007454739
Accuracy on test set: 0.7535121328224776
Fold:  1, Class dist.: [1366 4268], Acc: 0.764
Fold:  2, Class dist.: [1367 4268], Acc: 0.748
Fold:  3, Class dist.: [1367 4268], Acc: 0.746
Fold:  4, Class dist.: [1366 4269], Acc: 0.748
Fold:  5, Class dist.: [1366 4269], Acc: 0.768
Fold:  6, Class dist.: [1366 4269], Acc: 0.751
Fold:  7, Class dist.: [1366 4269], Acc: 0.772
Fold:  8, Class dist.: [1366 4269], Acc: 0.744
Fold:  9, Class dist.: [1366 4269], Acc: 0.762
Fold: 10, Class dist.: [1366 4269], Acc: 0.751

CV accuracy: 0.755 +/- 0.010
No description has been provided for this image

Naive Bayes¶

In [254]:
naive(table_X_Dogs_Adopted, table_y_Dogs_Adopted)
Accuracy on training set: 0.7356762513312034
Accuracy on test set: 0.7113665389527458
Fold:  1, Class dist.: [1366 4268], Acc: 0.727
Fold:  2, Class dist.: [1367 4268], Acc: 0.727
Fold:  3, Class dist.: [1367 4268], Acc: 0.712
Fold:  4, Class dist.: [1366 4269], Acc: 0.724
Fold:  5, Class dist.: [1366 4269], Acc: 0.733
Fold:  6, Class dist.: [1366 4269], Acc: 0.716
Fold:  7, Class dist.: [1366 4269], Acc: 0.730
Fold:  8, Class dist.: [1366 4269], Acc: 0.735
Fold:  9, Class dist.: [1366 4269], Acc: 0.709
Fold: 10, Class dist.: [1366 4269], Acc: 0.725

CV accuracy: 0.724 +/- 0.008
No description has been provided for this image
No description has been provided for this image
K-Nearest Neighbors (KNN)¶
In [256]:
X_smote, y_smote,df1 = smoteadopted(table_X_Dogs_Adopted, table_y_Dogs_Adopted,features)
Ourknn(X_smote, y_smote,3)
Accuracy on training set: 0.8839128907622058
Accuracy on test set: 0.7771338250790305
Fold:  1, Class dist.: [3416 3416], Acc: 0.759
Fold:  2, Class dist.: [3416 3416], Acc: 0.792
Fold:  3, Class dist.: [3416 3417], Acc: 0.751
Fold:  4, Class dist.: [3416 3417], Acc: 0.829
Fold:  5, Class dist.: [3416 3417], Acc: 0.808
Fold:  6, Class dist.: [3416 3417], Acc: 0.808
Fold:  7, Class dist.: [3417 3416], Acc: 0.827
Fold:  8, Class dist.: [3417 3416], Acc: 0.819
Fold:  9, Class dist.: [3417 3416], Acc: 0.809
Fold: 10, Class dist.: [3417 3416], Acc: 0.814

CV accuracy: 0.802 +/- 0.025
No description has been provided for this image
No description has been provided for this image

Support Vector Machine¶

In [258]:
svm(table_X_Dogs_Adopted, table_y_Dogs_Adopted)
Fold:  1, Class dist.: [1366 4268], Acc: 0.745
Fold:  2, Class dist.: [1367 4268], Acc: 0.748
Fold:  3, Class dist.: [1367 4268], Acc: 0.759
Fold:  4, Class dist.: [1366 4269], Acc: 0.751
Fold:  5, Class dist.: [1366 4269], Acc: 0.746
Fold:  6, Class dist.: [1366 4269], Acc: 0.744
Fold:  7, Class dist.: [1366 4269], Acc: 0.752
Fold:  8, Class dist.: [1366 4269], Acc: 0.757
Fold:  9, Class dist.: [1366 4269], Acc: 0.748
Fold: 10, Class dist.: [1366 4269], Acc: 0.749

CV accuracy: 0.750 +/- 0.005
Training set score (Accuracy) = 0.7559105431309904
Test set score (Accuracy) = 0.7369093231162197
------------------------------------------------------------------------------------------
LinearSVC coefficients and intercept:
Coeficients (w) =
 [[ 1.66029634e-05 -5.06078236e-02  2.69648567e-03 -9.56303762e-03
   1.93612794e-02  5.97777858e-03  4.32642784e-03  7.22599565e-03
  -7.17715463e-04 -7.54557136e-03  5.17667854e-04 -3.86695506e-04
  -2.71828312e-02 -1.62068346e-04 -2.63912997e-06  2.85086170e-02]]
Intercept (b) = [8.30148172e-06]

Random Forest¶

In [260]:
X_smote, y_smote,df1 = smoteadopted(table_X_Dogs_Adopted, table_y_Dogs_Adopted,features)
RandomF(X_smote, y_smote)
Accuracy on training set: 0.8912890762205831
Accuracy on test set: 0.8414120126448894
No description has been provided for this image
Fold:  1, Class dist.: [3416 3416], Acc: 0.605
Fold:  2, Class dist.: [3416 3416], Acc: 0.625
Fold:  3, Class dist.: [3416 3417], Acc: 0.606
Fold:  4, Class dist.: [3416 3417], Acc: 0.885
Fold:  5, Class dist.: [3416 3417], Acc: 0.930
Fold:  6, Class dist.: [3416 3417], Acc: 0.928
Fold:  7, Class dist.: [3417 3416], Acc: 0.935
Fold:  8, Class dist.: [3417 3416], Acc: 0.928
Fold:  9, Class dist.: [3417 3416], Acc: 0.935
Fold: 10, Class dist.: [3417 3416], Acc: 0.946

CV accuracy: 0.832 +/- 0.145

Cats¶

Tree Model¶

In [263]:
OurTree(table_X_Cats_Adopted, table_y_Cats_Adopted,20,features_Dogs)
Accuracy on training set: 0.7645486325313593
Accuracy on test set: 0.7552404438964242
Fold:  1, Class dist.: [1614 4222], Acc: 0.761
Fold:  2, Class dist.: [1614 4222], Acc: 0.763
Fold:  3, Class dist.: [1613 4223], Acc: 0.750
Fold:  4, Class dist.: [1613 4223], Acc: 0.747
Fold:  5, Class dist.: [1613 4223], Acc: 0.755
Fold:  6, Class dist.: [1614 4223], Acc: 0.764
Fold:  7, Class dist.: [1614 4223], Acc: 0.752
Fold:  8, Class dist.: [1614 4223], Acc: 0.761
Fold:  9, Class dist.: [1614 4223], Acc: 0.762
Fold: 10, Class dist.: [1614 4223], Acc: 0.727

CV accuracy: 0.754 +/- 0.011
No description has been provided for this image

Naive Bayes¶

In [265]:
naive(table_X_Cats_Adopted, table_y_Cats_Adopted)
Accuracy on training set: 0.7149907464528069
Accuracy on test set: 0.7114673242909988
Fold:  1, Class dist.: [1614 4222], Acc: 0.689
Fold:  2, Class dist.: [1614 4222], Acc: 0.737
Fold:  3, Class dist.: [1613 4223], Acc: 0.712
Fold:  4, Class dist.: [1613 4223], Acc: 0.700
Fold:  5, Class dist.: [1613 4223], Acc: 0.723
Fold:  6, Class dist.: [1614 4223], Acc: 0.711
Fold:  7, Class dist.: [1614 4223], Acc: 0.704
Fold:  8, Class dist.: [1614 4223], Acc: 0.718
Fold:  9, Class dist.: [1614 4223], Acc: 0.741
Fold: 10, Class dist.: [1614 4223], Acc: 0.704

CV accuracy: 0.714 +/- 0.015
No description has been provided for this image
No description has been provided for this image
K-Nearest Neighbors (KNN)¶
In [267]:
X_smote, y_smote,df1 = smoteadopted(table_X_Cats_Adopted, table_y_Cats_Adopted,features)
Ourknn(X_smote, y_smote,3)
Accuracy on training set: 0.8807061340941512
Accuracy on test set: 0.7561497326203208
Fold:  1, Class dist.: [3365 3365], Acc: 0.759
Fold:  2, Class dist.: [3365 3365], Acc: 0.737
Fold:  3, Class dist.: [3365 3365], Acc: 0.761
Fold:  4, Class dist.: [3365 3365], Acc: 0.758
Fold:  5, Class dist.: [3365 3365], Acc: 0.814
Fold:  6, Class dist.: [3365 3365], Acc: 0.824
Fold:  7, Class dist.: [3365 3365], Acc: 0.802
Fold:  8, Class dist.: [3365 3365], Acc: 0.807
Fold:  9, Class dist.: [3366 3365], Acc: 0.803
Fold: 10, Class dist.: [3365 3366], Acc: 0.807

CV accuracy: 0.787 +/- 0.029
No description has been provided for this image
No description has been provided for this image

Support Vector Machine¶

In [269]:
svm(table_X_Cats_Adopted, table_y_Cats_Adopted)
Fold:  1, Class dist.: [1614 4222], Acc: 0.737
Fold:  2, Class dist.: [1614 4222], Acc: 0.741
Fold:  3, Class dist.: [1613 4223], Acc: 0.729
Fold:  4, Class dist.: [1613 4223], Acc: 0.741
Fold:  5, Class dist.: [1613 4223], Acc: 0.733
Fold:  6, Class dist.: [1614 4223], Acc: 0.739
Fold:  7, Class dist.: [1614 4223], Acc: 0.715
Fold:  8, Class dist.: [1614 4223], Acc: 0.730
Fold:  9, Class dist.: [1614 4223], Acc: 0.739
Fold: 10, Class dist.: [1614 4223], Acc: 0.721

CV accuracy: 0.732 +/- 0.009
Training set score (Accuracy) = 0.7234217561176228
Test set score (Accuracy) = 0.7355117139334155
------------------------------------------------------------------------------------------
LinearSVC coefficients and intercept:
Coeficients (w) =
 [[ 1.60397922e-06 -3.17046092e-02 -2.12355930e-03 -1.60970449e-03
   4.44005663e-04  5.34751741e-04  8.35971566e-04  1.04477003e-03
   7.49589370e-04 -4.12631690e-04  2.44926785e-04 -3.27439410e-04
  -4.38366944e-03 -7.95952693e-05  2.73017726e-05  1.11212790e-02]]
Intercept (b) = [1.60397922e-06]

Random Forest¶

In [271]:
X_smote, y_smote,df1 = smoteadopted(table_X_Cats_Adopted, table_y_Cats_Adopted,features)
RandomF(X_smote, y_smote)
Accuracy on training set: 0.8610912981455064
Accuracy on test set: 0.8042780748663102
No description has been provided for this image
Fold:  1, Class dist.: [3365 3365], Acc: 0.659
Fold:  2, Class dist.: [3365 3365], Acc: 0.690
Fold:  3, Class dist.: [3365 3365], Acc: 0.650
Fold:  4, Class dist.: [3365 3365], Acc: 0.698
Fold:  5, Class dist.: [3365 3365], Acc: 0.898
Fold:  6, Class dist.: [3365 3365], Acc: 0.914
Fold:  7, Class dist.: [3365 3365], Acc: 0.902
Fold:  8, Class dist.: [3365 3365], Acc: 0.897
Fold:  9, Class dist.: [3366 3365], Acc: 0.902
Fold: 10, Class dist.: [3365 3366], Acc: 0.909

CV accuracy: 0.812 +/- 0.113

1.3. Classification - Final Discussion and Conclusions¶

The models generally demonstrated similar performance across the same tasks. As expected, we observed poorer performance in multilabel prediction tasks compared to binary classification, given the increased complexity of the former. It is evident that SMOTE contributed to enhancing the prediction of class 0.

It is also noteworthy that, after applying SMOTE, only some models exhibited improved performance. This could be attributed to the way each model operates. For instance, in the case of KNN, if synthetic data points were generated close to existing points, this could lead to improved predictions.

Overall, the accuracy for predicting adoption or non-adoption was approximately 80% in some models and around 75% in others, indicating consistent results. However, when predicting the speed of adoption, most models underperformed. This failure could be due to the imbalance between the number of features and the number of rows. In this analysis, 16 features were used against a dataset of 5,000 rows, representing a significant disparity that may compromise the viability of multilabel predictions.

In general, when attempting to reduce the number of classes, the accuracy of AdoptionSpeed increases from around 30% to approximately 55%. This demonstrates that most classifiers are not particularly effective distinguishing multiple classes, one reason for this can be the overlaping of points in classes, this is data with similar features but in different classes.

In future analyses, utilizing one-hot encoding could potentially improve results by increasing the number of columns and, consequently, reducing noise-related issues. In this case, noise in a single column could disproportionately impact results, contributing to the failure of multilabel predictions.

Prediction Adoption

Model Training ACC Test ACC Cross Validation ACC
Tree 0.755 0.751 0.755
Naive Bayes 0.719 0.729 0.720
KNN 0.881 0.771 0.788
Logistic Regression 0.660 0.658 0.659
Random Forest 0.874 0.835 0.822

Prediction Adoption Speed

Model Training ACC Test ACC Cross Validation ACC
Tree 0.379 0.356 0.360
Naive Bayes 0.349 0.348 0.348
KNN 0.697 0.472 0.495
SVM 0.347 0.357 0.351
Random Forest 0.546 0.367 0.369

In general, when using the same classifiers to predict the adoption of cats and dogs, there were some differences. This difference can be due to similarity in data of cats therefore obtaining better results in models that use distances or use similatiry as it can be seen in KNN and Random Forest. However, in conclusion, the differences observed are not significant, as they fall within the margin of error. The results are the same as in the prediction of all the animals.

When predicting adoption, the values are very similar. This could be due to the lower number of classes, which makes it easier to predict the classes and results in better performance.

Prediction Adoption Speed¶

Dogs¶

Model Training ACC Test ACC Cross Validation ACC
Tree 0.384 0.340 0.349
Naive Bayes 0.322 0.327 0.325
KNN 0.477 0.390 0.412
SVM 0.349 0.339 0.342
Random Forest 0.572 0.331 0.343

Cats¶

Model Training ACC Test ACC Cross Validation ACC
Tree 0.457 0.387 0.390
Naive Bayes 0.368 0.356 0.360
KNN 0.575 0.471 0.475
SVM 0.380 0.372 0.376
Random Forest 0.539 0.409 0.390

Prediction Adoption¶

Dogs¶

Model Training ACC Test ACC Cross Validation ACC
Tree 0.779 0.753 0.755
Naive Bayes 0.735 0.711 0.724
KNN 0.883 0.777 0.802
SVM 0.755 0.736 0.750
Random Forest 0.891 0.841 0.832

Cats¶

Model Training ACC Test ACC Cross Validation ACC
Tree 0.764 0.755 0.754
Naive Bayes 0.714 0.711 0.714
KNN 0.880 0.756 0.787
SVM 0.723 0.735 0.732
Random Forest 0.861 0.804 0.812

Task 2 (Unsupervised Learning) - Charactering Pets and their Adoption Speed¶

In this task you should use unsupervised learning to characterize pets and their adoption speed. You have 2 clustering:

  1. Use Clustering algorithms to find similar groups of adopted pets. When animals are adopted, is it possible to find groups of pets with the same/similar adoption speed? Evaluate clustering results using internal and external metrics.
  2. Be creative and define and explore your own unsupervised learning task! What else would it be interesting to find out?

2.1. Preprocessing Data for Clustering¶

To complete this part of the work, the dataset was first scaled. Given that K-Means relies on distance-based methods, it was essential to control the scale of the variables.

In [278]:
from LoadingData import *
from Models import *
table_X, table_y, features, target_name, df = load_data('PetFinder_dataset.csv')
%matplotlib inline
table_X_Scaled, table_y_Scaled, features_Scaled, target_name_Scaled, df_Scaled = loadScaledData(df)

2.2. Learning and Evaluating Clusterings¶

To explore potential similarities among the animals, clustering was performed using 2, 4, and 8 clusters. The aim was to determine whether the data could be effectively divided by factors such as the type of animal, adoption speed, or adoption speed based on the animal type.

The silhouette score evaluates how well each data point aligns with its assigned cluster. Scores range from -1 (worst) to 1 (best), with values near 0 indicating overlapping clusters.

The ARI (Adjusted Rand Index) ranges from -1 (completely dissimilar) to 1 (perfect match), with 0 indicating random clustering.

Clustering¶

In [282]:
i=2
print("Number of Clusters ", i ," \n")
OurKmeans(table_X_Scaled, table_y_Scaled,i)
Number of Clusters  2  

Kmeans silhouette_score: 0.0980206208981116
Kmeans Adjusted Rand Index (ARI): 0.0031734913631197123
HCA silhouette_score: 0.12665142552829103
HCA Adjusted Rand Index (ARI): 0.0007490641101930682
No description has been provided for this image
No description has been provided for this image
In [283]:
i=4
print("Number of Clusters ", i ," \n")
OurKmeans(table_X_Scaled, table_y_Scaled,i)
Number of Clusters  4  

Kmeans silhouette_score: 0.09880751324918044
Kmeans Adjusted Rand Index (ARI): 0.004793905554401137
HCA silhouette_score: 0.12048475375960427
HCA Adjusted Rand Index (ARI): 0.003456109079462546
No description has been provided for this image
No description has been provided for this image
In [284]:
i=8
print("Number of Clusters ", i ," \n")
OurKmeans(table_X_Scaled, table_y_Scaled,i)
Number of Clusters  8  

Kmeans silhouette_score: 0.10453662380484792
Kmeans Adjusted Rand Index (ARI): 0.009729493678839164
HCA silhouette_score: 0.06667274333941256
HCA Adjusted Rand Index (ARI): 0.010731271596755989
No description has been provided for this image
No description has been provided for this image
Model Nº Clusters Silhouette Score ARI
KNN 2 0.098 0.003
KNN 4 0.098 0.004
KNN 8 0.104 0.009
HCA 2 0.126 0.0007
HCA 4 0.120 0.003
HCA 8 0.066 0.010

As observed, the silhouette scores are generally close to 0, indicating that the clusters are very similar to one another. Meanwhile, the ARI shows that the clusters are almost random, likely due to overlap in the most significant components.

To determine if similar clusters exhibited comparable speeds, heatmaps were generated to visualize this information. Across all heatmaps, it was evident that each cluster contained a range of values for various speeds. This demonstrates that no clear separation rule exists when using speeds. Consequently, graphs with the principal components and the clusters were created to investigate this further.

Next, biclustering appeared to be an interesting avenue to explore. Using PCA, visualizations were created with both K-Means and Co-Clustering methods. Visually, clusters with 2 or 4 groups seem coherent, whereas the 8-cluster scenario is a clear example of overfitting it can be verified in the images below.

Biclustering¶

In [288]:
from sklearn.decomposition import PCA

table_X, table_y, features, target_name, df = load_data('PetFinder_dataset.csv')
table_X_Scaled, table_y_Scaled, features_Scaled, target_name_Scaled, df_Scaled = loadScaledData(df)
df_Scaled.to_excel("output_file.xlsx", index=False)

def plot_pca_with_clusters(X_scaled, num_clusters, df, ax):
    kmeans = KMeans(n_clusters=num_clusters)
    kmeans.fit(X_scaled)
    cluster_labels = kmeans.labels_
    pca = PCA(n_components=2)  
    pca_result = pca.fit_transform(X_scaled)
    
    sns.scatterplot(x=pca_result[:, 0], y=pca_result[:, 1], hue=cluster_labels, palette="Set1", s=10, edgecolor='black', ax=ax)
    ax.set_title(f"PCA of KMeans Clustering (Clusters: {num_clusters})")
    ax.set_xlabel("PCA Component 1")
    ax.set_ylabel("PCA Component 2")
    ax.legend(title="Cluster")

def plotBicluster(table_X, nclusters, ax):
    clustering = SpectralCoclustering(n_clusters=nclusters, random_state=0)
    clustering.fit(table_X)
    row_labels = clustering.row_labels_
    
    pca = PCA(n_components=2)
    table_X_PCA = pca.fit_transform(table_X)
    
    sns.scatterplot(x=table_X_PCA[:, 0], y=table_X_PCA[:, 1], hue=row_labels, palette="Set1", s=10, edgecolor='black', ax=ax)
    ax.set_title(f"PCA of Spectral Coclustering (Clusters: {nclusters})")
    ax.set_xlabel("PCA Component 1")
    ax.set_ylabel("PCA Component 2")
    ax.legend(title="Cluster")

for num_clusters in [2, 4, 8]:
    print(f"\nNumber of Clusters: {num_clusters}")

    fig, axes = plt.subplots(1, 2, figsize=(16, 6))

    plot_pca_with_clusters(table_X_Scaled, num_clusters, df, axes[0])  
    plotBicluster(table_X_Scaled, num_clusters, axes[1])  
    
    plt.tight_layout() 
    plt.show()
Number of Clusters: 2
C:\Users\Filip\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
No description has been provided for this image
Number of Clusters: 4
C:\Users\Filip\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
No description has been provided for this image
Number of Clusters: 8
C:\Users\Filip\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
No description has been provided for this image

Overfitting can be observed through the significant overlap of clusters in the following images, both in clustering and biclustering. This indicates that no clearly defined groups exist when considering the two principal components. In K-Means, overfitting becomes evident as soon as the number of clusters reaches 4. However, in Co-Clustering, the groups exhibit less overlap, resulting in a clearer visualization.

That said, the visualization with 2 clusters is better in K-Means, as points that are closer together generally tend to be more similar.

In [290]:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import SpectralCoclustering

def plotBiclusterHeatmap(table_X, nclusters, ax1):

    clustering = SpectralCoclustering(n_clusters=nclusters, random_state=0)
    clustering.fit(table_X)

    row_order = np.argsort(clustering.row_labels_)
    col_order = np.argsort(clustering.column_labels_)
    table_X_reordered = table_X.iloc[row_order, col_order]

    sns.heatmap(table_X_reordered, cmap='twilight_shifted', ax=ax1, cbar=True)
    ax1.set_title(f"Bicluster Heatmap (Clusters: {nclusters})")
    ax1.set_xlabel("Reordered Columns")
    ax1.set_ylabel("Reordered Rows")

fig, ax1 = plt.subplots(1, 1, figsize=(10, 8))
plotBiclusterHeatmap(table_X=df_Scaled, nclusters=5, ax1=ax1)
plt.tight_layout()
plt.show()
No description has been provided for this image

The best bicluster created was the one with 5 clusters, as it had 5 well-defined groups, each containing a specific number of features.

  • Cluster 1 included the features: Dewormed, Vaccinated, and Sterilized.
  • Cluster 2 contained the features: Gender, PhotoAmt, and Quantity.
  • Cluster 3 included: Breed1, MaturitySize, FurLength, and Health.
  • Cluster 4 contained: Type, Color2, and State.
  • Cluster 5 included: Age, Color1, and Fee.

We believe Cluster 1 makes sense, as each feature pertains to a medical procedure. Cluster 2, which includes PhotoAmt, Gender, and Quantity, also seems logical, as all these features are related to visual data. Cluster 3 contains breed wich normally reflects the maturitySize, the fur length and sometimes health may influence maturity and fur length. Cluster 4 is more challenging to interpret, as it's harder to find a clear correlation between its features.. Cluster 5, however, appears reasonable when considering that age and color might be correlated, and the Fee feature likely reflects this as well—older animals may not require a fee.

2.3. Clustering - Final Discussion and Conclusions¶

In general, when using K-Means and HCA, our results did not indicate clusters with similar adoption times. Instead, the clusters exhibited highly heterogeneous adoption times, making it difficult to draw clear conclusions. When plotting the data in the two principal components, we observed significant overlapping of data points, which suggests that the results may be influenced by this overlap. This could have impacted the ability of the clustering algorithms to distinguish distinct patterns in adoption times.

Our data and results indicated weak correlation within clusters, which appeared to be random and very similar to each other.

Given these challenges, we decided to explore biclustering as an alternative approach. The goal was to see if we could form better-defined clusters, identify the features associated with each cluster, and assess whether these groupings made sense in the context of adoption time.

The biclusters obtained, in general, made sense from our perception, as each cluster contained similar features, which suggests that biclustering is a promising method for segmenting this data. This approach allows for more meaningful clustering without focusing on speed, providing a better way to uncover inherent patterns in adoption times.

3. Final Comments and Conclusions¶

In the first part of the project we achieved good results when attempting to predict whether a pet would be adopted (or not).

In the multiclass prediction, we obtained 30% accuracy, which aligns with expectations, although we know that the model can performed better when using fewer classes. Creating a model with good performance is a challenge in multiclass prediction.

A major contributing factor to this is the imbalance between the number of features and the number of rows (16 features x 5000 rows). To address this, we applied SMOTE to help identify the minority class, preventing data removal and maintaining variance in the dataset. These results suggest that there is room for improvement by adjusting the features and possibly using one-hot encoding to further enhance the data preprocessing step.

Overall, the best results were consistently achieved with KNN and Random Forest. This outcome is consistent with the nature of the data, as the data points share similarities, and both models are effective at identifying and leveraging these similarities for classification. Given the characteristics of the data, these models are better suited to predict animal adoption rather than the speed at which they will be adopted.

In terms of clustering, the groups were not effectively clustered by adoption speed, as verified by the analysis, and overall, the clustering performance was poor. This indicates that clustering methods may not be ideal for this dataset, and further adjustments or alternative approaches may be required to improve the results.